Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

RNA-Seq Data Analysis ◾ 173

overlap with multiple features (non-unique). A read that maps or overlaps with multiple

features is considered as ambiguous and will not be counted. Only reads mapping unam-

biguously to a single feature are counted.

For RNA-Seq read counts, we can use a variety of programs but the most used free open-

source programs include HTSeq-count [19] (Python-based program) and FeatureCounts

[20], which is used as a Linux command-line program or as a function in the R package

Rsubread. Both these programs require BAM files and GFT file (reference annotation file)

as inputs.

In the following, we will use HTSeq-count to count RNA-seq reads aligned to the genes

in chromosome 22. Install HTSeq-count by following the installation instructions avail-

able at “https://htseq.readthedocs.io/en/master/install.html”. The following HTSeq-count

command counts aligned reads in all BAM files stored in the “bam” subdirectory and saves

the output in “features/htcount.txt”. The “-m union” option specifies the union mode to

handle reads overlapping more than one feature. We used “--additional-attr=transcript_

id” to add transcript accession numbers to the output.

mkdir features

htseq-count \

-m union \

-f bam \

--additional-attr=transcript_id \

-s yes bam/*.bam \

gtf/hg38.ncbiRefSeq.gtf \

> features/htcount.txt

sed ‘/^__/ d’ < htcount.txt > htcount2.txt

The “sed” command has been used to remove the last rows that begin with “__” and to save

that change in a new file “htcount2.txt”, which we will use in the next step of the analysis.

FIGURE 5.3 HTSeq-count feature count.